Skip to content
This repository was archived by the owner on May 14, 2026. It is now read-only.

fix: httpjson callables to trace attempts (started, failed)#3300

Merged
diegomarquezp merged 30 commits into
mainfrom
fix-httpjson-retry-metrics
Nov 13, 2024
Merged

fix: httpjson callables to trace attempts (started, failed)#3300
diegomarquezp merged 30 commits into
mainfrom
fix-httpjson-retry-metrics

Conversation

@diegomarquezp

@diegomarquezp diegomarquezp commented Oct 21, 2024

Copy link
Copy Markdown
Contributor

Fixes #2503

Approach

This PR uses the approach suggested by the same bug.

We should move creating the TracedUnaryCallable to the last step for HttpJson. This would require refactoring this file: https://github.com/googleapis/sdk-platform-java/blob/7902a41c87240d607179d07c28cce462ea135c5f/gax-java/gax-httpjson/src/main/java/com/google/api/gax/httpjson/HttpJsonCallableFactory.java

Confirmation that it works

The confirmation that it works is in the modified tests (e.g. retry()). When using the proposed test version against the version of HttpJsonCallableFactory from the main branch, it will fail because it will not record attempts started or failed whatsoever:

image

The image above shows all tests having failed when using the production files from main, implying that no tracer attempts are recorded for any of the tests.

@product-auto-label product-auto-label Bot added the size: s Pull request size is small. label Oct 21, 2024
@product-auto-label product-auto-label Bot added size: m Pull request size is medium. and removed size: s Pull request size is small. labels Oct 21, 2024
Comment thread gax-java/gax-httpjson/src/test/java/com/google/api/gax/httpjson/RetryingTest.java Outdated
@lqiu96

lqiu96 commented Oct 22, 2024

Copy link
Copy Markdown
Member

I think you can also remove a bunch of the @Disabled annotations in the showcase ITs as well.

@Disabled("https://github.com/googleapis/sdk-platform-java/issues/2503")

I think they should pass with your PR.

Comment thread gax-java/gax-httpjson/src/test/java/com/google/api/gax/httpjson/RetryingTest.java Outdated

@lqiu96 lqiu96 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try and re-enable the Disabled ITOtelMetric tests and see if it works? LGTM otherwise

Edit: Spoke too soon.

@diegomarquezp diegomarquezp marked this pull request as ready for review October 25, 2024 23:32
Comment on lines +350 to +351
// the number of attempts varies. Here we just make sure that all of them except the last are
// considered as failed

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq, isn't the last one also considered a failed attempt? Why is the tracerAttemptsFailed != tracerAttempts?

Also, can we add the operationfailed == true check here as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the tracerAttemptsFailed != tracerAttempts?

I'm not sure and I agree they should be the same value since this test just fetches an exception until it can't retry anymore (retries exhausted).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from discussion with @blakeli0, @lqiu96 and @zhumin8, let's double check TestApiTracer, confirm with showcase.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

showcase has another tracer we can use to cross check TestApiTracer

@lqiu96 lqiu96 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Added a few questions/ nits

*/
public class TestApiTracer extends BaseApiTracer {

private final AtomicInteger tracerAttempts;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they can be called just attempts, same for the fields below.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to attemptsStarted, removed the tracer prefix from all fields

tracerOperationFailed,
tracerFailedRetriesExhausted);

public AtomicInteger getTracerAttempts() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to get the tracer instance, and use the getters from the tracer in unit tests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* exhausted.
*/
public class TestApiTracerFactory implements ApiTracerFactory {
private final AtomicInteger tracerAttempts = new AtomicInteger();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can be initialized within the tracer for better encapsulation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to initialize them in the tracer itself. However the clients only expect tracer factories, so there is no way to have control over how the actual tracers are instantiated.
I'll double check on this.

clientContext =
ClientContext.newBuilder()
// we use a custom tracer to confirm whether the retrials are being recorded.
.setTracerFactory(tracerFactory)
.setExecutor(executor)
.setClock(fakeClock)
.setDefaultCallContext(HttpJsonCallContext.createDefault())
.build();

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nvm, I realized that accessing the ApiTracer instance should be enough.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

@Override
public void attemptStarted(int attemptNumber) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of a test tracer in general. On the other hand, we should only implement what is needed for our use cases, so that we have lower chance to run into issues that are not related to our PR. For example, maybe we don't need to implement attemptFailedRetriesExhausted.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lqiu96 figured that the retries exhausted event records the missing failed attempt in our tests:

It looks like attempt_count is recorded when this is called:

metricsRecorder.recordAttemptCount(1, attributes);
. This seems like the number recorded in Otel should be correct.
I'm guessing your TestApiTracer would need to increment the attemptCount when attemptFailedRetriesExhausted is called as well to match behavior.

I'll keep this implementation and its retries exhausted flag.

* Test tracer that keeps count of different events. See {@link TestApiTracerFactory} for more
* details.
*/
public class TestApiTracer extends BaseApiTracer {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the interface ApiTracer instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@lqiu96

lqiu96 commented Nov 6, 2024

Copy link
Copy Markdown
Member

FYI @diegomarquezp
CI error:

Error:  Failures: 
Error:    RetryingTest.retryKeepFailing:360 expected to be true

@sonarqubecloud

Copy link
Copy Markdown

@sonarqubecloud

Copy link
Copy Markdown

Quality Gate Passed Quality Gate passed for 'java_showcase_integration_tests'

Issues
3 New issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

@blakeli0 blakeli0 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please make sure @lqiu96 has no other concerns before merging.

@diegomarquezp

Copy link
Copy Markdown
Contributor Author

From @lqiu96 (thanks!):

I have a suspicion as to why the race condition exists. I took a look into the code and I think I understand what you were explaining in the call.
My suspicion of the cause of the race condition:


This sets the result to the future which frees up the thread that is running the test. (basically .get() no longer blocks and if the thread schedules the test to run, it will). The tracerFinisher code which sets the operation failure status runs after this

This means that the tests removed in 065b01f were actually expected behavior and is due to the way we set up the tests.

@diegomarquezp diegomarquezp merged commit 15a64ee into main Nov 13, 2024
@diegomarquezp diegomarquezp deleted the fix-httpjson-retry-metrics branch November 13, 2024 23:29
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

size: m Pull request size is medium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HttpJson Implementation is recording Operation(Success|Fail) metric before retries are attempted

3 participants